We will study whether genes belonging to specific annotated gene sets, such as Gene Ontology (GO) terms, are enriched in the set of genes of interest, in this case the list of differentially expressed genes (DEGs). This analysis is commonly referred to as Over-Representation Analysis (ORA).
To conduct such an analysis, we need:
A set of genes obtained from experimental data that are biologically relevant for our research, in this case the list of DEGs. These genes must be annotated, meaning we need information about which known biological functions they are associated to. For this, we will make use of an annotation file generated through BLASTing Pelobates cultripes transcripts against the Ensembl Xenopus tropicalis proteome using diamond (v2.1.8).
A ‘background’ set of genes, which includes all genes that could potentially have been differentially expressed. This background set serves as the reference against which the enrichment of specific gene sets is assessed.
A curated database containing biologically relevant categories, such as GO terms or gene pathways, along with the genes associated with each category.
pacman::p_load(tidyverse, DESeq2, gprofiler2)
# The list of DEG results
res<-readRDS("./results/deseq2_regions_results_local.rds")
This annotation file contains all P. cultripes transcripts by rows. As columns, we can find: * The gene IDs for P. cultripes, followed by the transcripts IDs and the peptides IDs (gene_id, transcript_id, peptide_id) * The IDs and descriptions of X. tropicalis annotated proteome resulting from both nucleotide and peptide blasting against P. cultripes transcripts (xenx_pep_id, xenx_gene_symbol, xenx_description, xenp_pep_id, xenp_gene_symbol, xenp_description).
# The annotation file
xtrop<-read.csv("./xtr109/diamondblast109.csv", stringsAsFactors = FALSE)
However, this annotation file contains multiple annotations per gene. This means, in the columns corresponding to the results of BLASTing the genes, the fields (ID field or description field), we can find more than one element that is associated to one only gene.
g:Profiler is unable to process these entries, therefore we have to process the file so that it can interpret the multiple possible annotations for each gene.
The goal is to split multiple annotations within a single cell, unlist them to handle them individually and ensure uniqueness so that it is understandable for g:Profiler.
We will extract the X. tropicalis gene symbols as input for g:Profiler.
pull_genes<-function(x, alpha=0.05, lfc=0, signed="both", feature="xenp_gene_symbol") {
# filter by adjusted p
x<-x %>%
filter(padj<alpha)
# 3 modes for filtering based on log fold change, controlled by the signed parameter
if(signed=="up"){
x<-x %>%
filter(log2FoldChange>lfc) %>%
pull(feature)
} # include only genes where the log fold change is greater than a specified threshold (lfc, which defaults to 0)
if(signed=="down"){
x<-x %>%
filter(log2FoldChange<(lfc*-1)) %>%
pull(feature)
} # include only genes where the log fold change is less than the negative of the specified threshold (-lfc)
if(signed=="both"){
x<-x %>%
filter(abs(log2FoldChange)>lfc) %>%
pull(feature)
} # include genes where the absolute value of the log fold change is greater than the specified threshold (lfc), includes changes in either direction.
# deal with multiple gene annotations for the same gene (different transcripts)
x<-strsplit(x, ";") %>% unlist() %>% unique()
# separates different annotations in different rows
x<-x[!is.na(x)]
return(x[!is.na(x)])
}
# now extract all
sig_deg<-lapply(res, FUN=function(x)
x %>%
as_tibble(rownames = "gene_id") %>%
left_join(xtrop) %>%
pull_genes(feature = "xenp_pep_id", alpha = 0.05, lfc=0, signed = "both")
)
## Joining with `by = join_by(gene_id)`
## Joining with `by = join_by(gene_id)`
str(sig_deg)
## List of 2
## $ central: chr [1:106] "ENSXETP00000106396" "ENSXETP00000089706" "ENSXETP00000118428" "ENSXETP00000118788" ...
## $ south : chr [1:1065] "ENSXETP00000042154" "ENSXETP00000095923" "ENSXETP00000079038" "ENSXETP00000092115" ...
We will use the full set of genes that were returned by
DESeq2. This set should have filtered out genes that have
low counts.
We can use the same function from earlier to convert our list of Pelobates IDs to Xenopus peptide IDs.
# function to pull out xtr background IDs
extract_xtr<-function(x) {
return(
xtrop %>%
filter(gene_id %in% x) %>%
pull(xenp_pep_id) %>%
str_split(pattern=";") %>%
unlist() %>%
na.omit() %>%
unique()
)
}
xtr_bg<-lapply(res, FUN= function(x) extract_xtr(rownames(x)))
str(xtr_bg)
## List of 2
## $ central: chr [1:15183] "ENSXETP00000113033" "ENSXETP00000042154" "ENSXETP00000095923" "ENSXETP00000041999" ...
## $ south : chr [1:15183] "ENSXETP00000113033" "ENSXETP00000042154" "ENSXETP00000095923" "ENSXETP00000041999" ...
# lets unify the background to use the same across all populations
xtr_bg_all<-xtr_bg %>%
unlist() %>%
unique()
We will use g:Profiler tool, which is a web-based tool for functional enrichment analysis of gene lists.
The associated R package for g:Profiler is gprofiler2, which is an API (Application Programming Interface). This API serves as a bridge that allows performing the actual analysis within the g:Profiler server, by using the R environment in RStudio.
g:Profiler is continuously being updated to match the updates in the associated Ensembl and Ensembl Genomes Data Bases. Because of this, it is important to specify which version of g:Profiler, and therefore Ensembl version, we would like to use. We will do this by looking at the archives on the g:Profiler homepage and setting the base url for the desired version.
Our annotations resulted from BLASTing P. cultripes transcriptome against the 109 Ensembl release of the X. tropicalis proteome, therefore we will use Ensembl 109, Ensembl Genomes 56 (database built on 2023-03-29).
# Install and load the gprofiler2 package. This package provides functions to access the gprofiler API directly from R.
# install.packages("gprofiler2")
library(gprofiler2)
# Setting the base url
# set base url:
set_base_url("https://biit.cs.ut.ee/gprofiler_archive3/e109_eg56_p17/")
# run the analysis
res_ora<-gost(multi_query = FALSE, # returns separate results tables for multiquery
custom_bg = xtr_bg_all, # our background
query=sig_deg, # our list of gene sets
organism="xtropicalis", # the organism our annotations belong to
exclude_iea = FALSE, # include GO terms that were electronically assigned
correction_method = "gSCS", # the recommended multiple testing correction.
sources=c("GO:BP","GO:CC","GO:MF", "KEGG","REAC"), # the functional sets we are interested in
evcodes=FALSE, # evcodes TRUE needed for downstream analysis like enrichment maps in Cytoscape, but it takes longer
significant= FALSE) # return all terms, not just the significant ones
## Detected custom background input, domain scope is set to 'custom'.
# The results are stored as a "results" dataframe
head(res_ora$result)
## query significant p_value term_size query_size intersection_size
## 1 central TRUE 0.001855428 4558 100 60
## 2 central TRUE 0.002466472 3863 100 54
## 3 central TRUE 0.002476485 4221 100 57
## 4 central FALSE 0.373583457 1244 100 24
## 5 central FALSE 0.450705336 8450 100 82
## 6 central FALSE 0.675368296 1200 100 23
## precision recall term_id source term_name
## 1 0.60 0.013163668 GO:0065007 GO:BP biological regulation
## 2 0.54 0.013978773 GO:0050794 GO:BP regulation of cellular process
## 3 0.57 0.013503909 GO:0050789 GO:BP regulation of biological process
## 4 0.24 0.019292605 GO:0009889 GO:BP regulation of biosynthetic process
## 5 0.82 0.009704142 GO:0009987 GO:BP cellular process
## 6 0.23 0.019166667 GO:0051252 GO:BP regulation of RNA metabolic process
## effective_domain_size source_order parents
## 1 13686 16811 GO:0008150
## 2 13686 14008 GO:00099....
## 3 13686 14004 GO:00081....
## 4 13686 3827 GO:00090....
## 5 13686 3889 GO:0008150
## 6 13686 14368 GO:00160....
Let’s look at the enrichment results in detail
colnames(res_ora$result)
## [1] "query" "significant" "p_value"
## [4] "term_size" "query_size" "intersection_size"
## [7] "precision" "recall" "term_id"
## [10] "source" "term_name" "effective_domain_size"
## [13] "source_order" "parents"
query: This column represents the gene set or query submitted to g:Profiler for enrichment analysis. Each row corresponds to a different gene set or query.
significant: This column indicates whether the enrichment analysis identified the term (functional category) as statistically significant. It contains a boolean value (TRUE or FALSE).
p_value: This column contains the adjusted p-values. Each p-value indicates the statistical significance of enrichment of a given term in the gene set.
term_size: This column represents the total number of genes in a specific term (functional category).
query_size: This column represents the total number of genes in the submitted query or gene set.
intersection_size: This column represents the number of genes that are shared between the submitted query and the term being analyzed. That is, how many genes from both term size and query size are overlapping.
term_id: This column contains the unique identifier (ID) associated with the term (functional category) being analyzed.
term_name: This column contains the name or description of the term (functional category) being analyzed.
source: This column indicates the data source or database from which the term originates and that we requested (Gene Ontology, KEGG, Reactome).
precision: This column represents the proportion of the genes shared between the submitted query and the term being analyzed, relative to the total number of genes in the query. It measures the accuracy of the enrichment analysis. A higher precision value indicates that a larger proportion of the genes in the intersection are relevant to the term being analyzed, relative to the total number of genes in the query. In other words, it reflects how well the term represents the genes in the query.
recall: This column represents the proportion of the genes in the intersection, the same as the precision column, but relative to the total number of genes in the term. It reflects how well the term represents the genes in the dataset or background.
effective_domain_size: This column represents the effective domain size used in the enrichment analysis. It is a statistical parameter used to calculate the p-value.
source_order: This column represents the order or rank of the source (data database) based on its importance or relevance in the enrichment analysis.
parents: This column contains information about the parent terms or higher-level categories to which the analyzed term belongs. It may provide additional hierarchical context for the analyzed terms.
We can use a p-value cutoff of 0.05 to see how many terms have been functionally enriched in each gene set.
res_ora$result %>%
filter(p_value<0.05) %>%
group_by(query) %>%
dplyr::count(query, sort=TRUE)
## # A tibble: 2 × 2
## # Groups: query [2]
## query n
## <chr> <int>
## 1 south 12
## 2 central 9
res_ora$result %>%
filter(p_value<0.05) %>%
group_by(query)
## # A tibble: 21 × 14
## # Groups: query [2]
## query significant p_value term_size query_size intersection_size precision
## <chr> <lgl> <dbl> <int> <int> <int> <dbl>
## 1 central TRUE 1.86e-3 4558 100 60 0.6
## 2 central TRUE 2.47e-3 3863 100 54 0.54
## 3 central TRUE 2.48e-3 4221 100 57 0.57
## 4 central TRUE 6.62e-5 79 100 7 0.07
## 5 central TRUE 4.16e-4 242 100 10 0.1
## 6 central TRUE 9.11e-3 119 100 6 0.06
## 7 central TRUE 3.51e-2 59 100 4 0.04
## 8 central TRUE 3.98e-2 61 100 4 0.04
## 9 central TRUE 4.91e-2 110 100 5 0.05
## 10 south TRUE 4.82e-2 818 985 96 0.0975
## # ℹ 11 more rows
## # ℹ 7 more variables: recall <dbl>, term_id <chr>, source <chr>,
## # term_name <chr>, effective_domain_size <int>, source_order <int>,
## # parents <list>
gostplot(res_ora)
Custom dot plot using the gprofiler results tables and ggplot.
custom_query_names <- res_ora$result %>%
mutate(query = case_when(
query == "central" ~ "Less plastic",
query == "south" ~ "Plastic",
TRUE ~ query
))
custom_query_names %>%
select(query,term_name, p_value, intersection_size, query_size,source) %>%
filter(p_value<0.05) %>%
mutate(GeneRatio=intersection_size/query_size) %>%
arrange(GeneRatio) %>%
mutate(term_name = factor(term_name, levels=unique(term_name))) %>%
ggplot(aes(x=GeneRatio, y=term_name)) +
geom_point(aes(color=p_value, size=intersection_size)) +
ylab("") +
# scale_color_scico(palette = "bamako", direction = 1) +
facet_grid(source~query,scales = "free_y",space = "free") +
theme_bw()
ggsave("functional_enrich.jpeg", width = 8, height = 8.5, dpi = 600)